feat: update github actions workflow #9

rabioinf · 2025-01-22T09:48:13Z

Summary by CodeRabbit

New Features
- Updated GitHub Actions workflows with improved configuration and version upgrades
- Added comprehensive configuration files for genomic data processing workflow
- Introduced new Snakemake workflow for genome data analysis
Documentation
- Significantly enhanced README files with detailed workflow instructions
- Added JSON schemas for configuration validation
- Improved inline documentation and workflow descriptions
Chores
- Updated dependency versions across multiple configuration files
- Added new environment configurations for various bioinformatics tools
- Introduced new workflow scripts and rules

feat: complete minimal workflow as template

coderabbitai · 2025-01-22T09:48:22Z

Walkthrough

This pull request introduces a comprehensive update to a Snakemake-based genomic data processing workflow. The changes span multiple configuration files, GitHub Actions workflows, and workflow scripts. Key modifications include updating workflow configurations, adding new environment definitions, introducing schema validations, and enhancing documentation. The workflow now supports genome data retrieval, read simulation, quality control, adapter trimming, and multi-sample report generation with improved modularity and configuration management.

Changes

File	Change Summary
`.github/workflows/*`	Updated workflow configurations, including PR validation, CI/CD triggers, and action versions
`.gitignore`	Added `.test/results/*` to ignored files
`config/`	Added comprehensive configuration files for workflow parameters, schemas, and MultiQC settings
`README.md`	Significantly expanded documentation with badges, workflow overview, and execution instructions
`workflow/Snakefile`	Added configuration loading, rule inclusions, and workflow target definition
`workflow/envs/`	Created environment files for tools like cutadapt, fastqc, multiqc, and genome retrieval
`workflow/rules/`	Added modular rules for genome processing, read simulation, and quality control
`workflow/scripts/get_genome.py`	New script for genome data retrieval and validation

Sequence Diagram

sequenceDiagram
    participant User
    participant Workflow
    participant NCBI
    participant Genome
    participant Reads
    participant QC
    participant Trimming
    participant Report

    User->>Workflow: Configure parameters
    Workflow->>NCBI: Retrieve genome data
    NCBI-->>Genome: Download FASTA/GFF
    Workflow->>Reads: Simulate sequencing reads
    Reads->>QC: Perform quality check
    QC->>Trimming: Trim adapters
    Trimming->>Report: Generate MultiQC report
    Report-->>User: Workflow complete

Poem

🐰 A Snakemake tale of genomic delight,
Workflows dancing with scientific might!
From NCBI's realm to reads so bright,
Adapters trimmed, quality shining light,
MultiQC reports, a researcher's delight! 🧬

✨ Finishing Touches

📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 6

♻️ Duplicate comments (1)

config/config.yml (1)

8-14: ⚠️ Potential issue

Fix duplicate keys in gff_source_type configuration.

Same issue as in test config - duplicate "RefSeq" keys will cause data loss.

🧹 Nitpick comments (12)

workflow/Snakefile (1)
32-33: Enhance error handling in the onerror hook.

The error message could be more informative by including the error details.
-    print("--- An error occurred! ---")
+    print("--- An error occurred! ---")
+    print(f"Error: {workflow.logger.get_error_message()}")
+    print("Check the log files in the results directory for more details.")
workflow/rules/process_reads.smk (1)
90-90: Consider adjusting thread allocation for cutadapt.

Using 25% of total cores might be excessive for smaller files. Consider setting a maximum limit.
-    threads: int(workflow.cores * 0.25)
+    threads: min(8, max(1, int(workflow.cores * 0.25)))
workflow/scripts/get_genome.py (1)
61-61: Improve code style and readability.

Several style improvements can be made:

Use not in instead of not ... in

Simplify dict key checks
-                if not "Name" in rec_keys:
+                if "Name" not in rec_keys:

-                if not "ID" in rec_keys:
+                if "ID" not in rec_keys:

-        for k in ncbi_genome.keys():
+        for k in ncbi_genome:
Also applies to: 72-72, 107-107

🧰 Tools

🪛 Ruff (0.8.2)

61-61: Test for membership should be not in

Convert to not in

(E713)
.test/config/multiqc_config.yml (1)

2-2: Add newline at end of file.

Add a newline character at the end of the file to comply with YAML best practices.

🧰 Tools

🪛 yamllint (1.35.1)

[error] 2-2: no new line character at the end of file

(new-line-at-end-of-file)
config/schemas/samples.schema.yml (1)
1-25: Consider adding additional schema validations.

The schema is well-structured, but could benefit from additional validations:

Pattern validation for sample names (e.g., no spaces/special chars)

Minimum value for replicate (e.g., >= 1)

Pattern validation for read files (e.g., must end in .fastq.gz)

Example additions:
   sample:
     type: string
     description: sample name/identifier
+    pattern: "^[A-Za-z0-9_-]+$"
   replicate:
     type: number
     default: 1
     description: consecutive numbers representing multiple replicates of one condition
+    minimum: 1
   read1:
     type: string
     description: names of fastq.gz files, read 1
+    pattern: ".*\\.fastq\\.gz$"
   read2:
     type: string
     description: names of fastq.gz files, read 2 (optional)
+    pattern: ".*\\.fastq\\.gz$"
config/schemas/config.schema.yml (3)
8-20: Add type and required field definitions for the get_genome object.

The schema for get_genome should specify its type and required fields for better validation.
   get_genome:
+    type: object
     properties:
       database:
         type: ["string", "null"]
+        enum: ["manual", "ncbi"]
       assembly:
         type: ["string", "null"]
+        pattern: "^GC[F|A]_[0-9]+\\.[0-9]+$"
       fasta:
         type: ["string", "null"]
       gff:
         type: ["string", "null"]
       gff_source_type:
         type: array
+        items:
+          type: string
+    required: ["database"]
21-28: Add constraints for numeric fields in simulate_reads.

The schema for simulate_reads should include minimum/maximum constraints for numeric fields to prevent invalid values.
   simulate_reads:
+    type: object
     properties:
       read_length:
         type: number
+        minimum: 1
+        maximum: 1000
       read_number:
         type: number
+        minimum: 1
       random_freq:
         type: number
+        minimum: 0
+        maximum: 1
+    required: ["read_length", "read_number", "random_freq"]
30-37: Add pattern validation for adapter sequences.

The schema for cutadapt should include pattern validation for adapter sequences to ensure they contain valid nucleotide sequences.
   cutadapt:
+    type: object
     properties:
       threep_adapter:
         type: string
+        pattern: "^-a [ATCG]+$"
       fivep_adapter:
         type: string
+        pattern: "^-A [ATCG]+$"
       default:
         type: array
+        items:
+          type: string
+    required: ["threep_adapter", "fivep_adapter"]
.github/workflows/main.yml (1)
49-51: Consider adding test artifacts retention.

Consider adding artifact retention for test results to help with debugging pipeline failures.
          directory: .test
          snakefile: workflow/Snakefile
          args: "--use-conda --show-failed-logs --cores 3 --conda-cleanup-pkgs cache"
+      - uses: actions/upload-artifact@v4
+        if: always()
+        with:
+          name: test-results
+          path: .test/results
+          retention-days: 14
config/README.md (1)
73-73: Fix code span formatting.

Remove extra spaces inside code spans to comply with Markdown formatting rules.
-[`-q 10 `, `-m 25 `, `-M 100`, `--overlap=5`]
+[`-q 10`, `-m 25`, `-M 100`, `--overlap=5`]
🧰 Tools

🪛 Markdownlint (0.37.0)

73-73: null
Spaces inside code span elements

(MD038, no-space-in-code)

73-73: null
Spaces inside code span elements

(MD038, no-space-in-code)
README.md (1)
117-122: Update list formatting for consistency.

Convert asterisks to dashes for consistent unordered list formatting.
-* Replace `<owner>` and `<repo>` everywhere in the template
-* Replace `<name>` with the workflow name
-* Replace `<description>` with a description
-* Update the workflow description
-* Update the `README.md` badges
+- Replace `<owner>` and `<repo>` everywhere in the template
+- Replace `<name>` with the workflow name
+- Replace `<description>` with a description
+- Update the workflow description
+- Update the `README.md` badges
🧰 Tools

🪛 LanguageTool

[style] ~119-~119: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...w name (can be the same as <repo>). * Replace <description> with a description of w...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 Markdownlint (0.37.0)

117-117: Expected: dash; Actual: asterisk
Unordered list style

(MD004, ul-style)

118-118: Expected: dash; Actual: asterisk
Unordered list style

(MD004, ul-style)

119-119: Expected: dash; Actual: asterisk
Unordered list style

(MD004, ul-style)

120-120: Expected: dash; Actual: asterisk
Unordered list style

(MD004, ul-style)

121-121: Expected: dash; Actual: asterisk
Unordered list style

(MD004, ul-style)

122-122: Expected: dash; Actual: asterisk
Unordered list style

(MD004, ul-style)
config/multiqc_config.yml (1)
1-2: Add newline at end of file.

Add a newline character at the end of the file to comply with YAML formatting standards.
 remove_sections:
   - samtools-stats
+
🧰 Tools

🪛 yamllint (1.35.1)

[error] 2-2: no new line character at the end of file

(new-line-at-end-of-file)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c7a73eb and 7a3a40e.

⛔ Files ignored due to path filters (1)

.test/config/samples.tsv is excluded by !**/*.tsv

📒 Files selected for processing (21)

.github/workflows/conventional-prs.yml (2 hunks)
.github/workflows/main.yml (2 hunks)
.github/workflows/release-please.yml (1 hunks)
.gitignore (1 hunks)
.test/config/config.yml (1 hunks)
.test/config/multiqc_config.yml (1 hunks)
README.md (1 hunks)
config/README.md (1 hunks)
config/config.yml (1 hunks)
config/multiqc_config.yml (1 hunks)
config/schemas/config.schema.yml (1 hunks)
config/schemas/samples.schema.yml (1 hunks)
workflow/Snakefile (1 hunks)
workflow/envs/cutadapt.yml (1 hunks)
workflow/envs/fastqc.yml (1 hunks)
workflow/envs/get_genome.yml (1 hunks)
workflow/envs/multiqc.yml (1 hunks)
workflow/envs/simulate_reads.yml (1 hunks)
workflow/rules/common.smk (1 hunks)
workflow/rules/process_reads.smk (1 hunks)
workflow/scripts/get_genome.py (1 hunks)

✅ Files skipped from review due to trivial changes (7)

workflow/envs/multiqc.yml
workflow/envs/get_genome.yml
workflow/envs/fastqc.yml
workflow/envs/simulate_reads.yml
.gitignore
.github/workflows/release-please.yml
workflow/envs/cutadapt.yml

🧰 Additional context used

🪛 yamllint (1.35.1)

.test/config/multiqc_config.yml

[error] 2-2: no new line character at the end of file

(new-line-at-end-of-file)

config/multiqc_config.yml

[error] 2-2: no new line character at the end of file

(new-line-at-end-of-file)

🪛 LanguageTool

config/README.md

[style] ~81-~81: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...w name (can be the same as <repo>). * Replace <description> with a description of w...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

README.md

[uncategorized] ~53-~53: You might be missing the article “the” here.
Context: ...### Execution To run the workflow from command line, change the working directory. ``...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)

[style] ~119-~119: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...w name (can be the same as <repo>). * Replace <description> with a description of w...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 Markdownlint (0.37.0)

config/README.md

73-73: null
Spaces inside code span elements

(MD038, no-space-in-code)

73-73: null
Spaces inside code span elements

(MD038, no-space-in-code)

README.md

117-117: Expected: dash; Actual: asterisk
Unordered list style

(MD004, ul-style)

118-118: Expected: dash; Actual: asterisk
Unordered list style

(MD004, ul-style)

119-119: Expected: dash; Actual: asterisk
Unordered list style

(MD004, ul-style)

120-120: Expected: dash; Actual: asterisk
Unordered list style

(MD004, ul-style)

121-121: Expected: dash; Actual: asterisk
Unordered list style

(MD004, ul-style)

113-113: null
Bare URL used

(MD034, no-bare-urls)

100-100: null
Spaces inside code span elements

(MD038, no-space-in-code)

100-100: null
Spaces inside code span elements

(MD038, no-space-in-code)

🪛 Ruff (0.8.2)

workflow/scripts/get_genome.py

17-17: Undefined name snakemake

(F821)

18-18: Undefined name snakemake

(F821)

19-19: Undefined name snakemake

(F821)

20-20: Undefined name snakemake

(F821)

21-21: Undefined name snakemake

(F821)

22-22: Undefined name snakemake

(F821)

23-23: Undefined name snakemake

(F821)

24-24: Undefined name snakemake

(F821)

29-29: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

29-29: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

41-41: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

41-41: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

47-47: f-string without any placeholders

Remove extraneous f prefix

(F541)

55-55: Undefined name snakemake

(F821)

61-61: Test for membership should be not in

Convert to not in

(E713)

72-72: Test for membership should be not in

Convert to not in

(E713)

107-107: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

🪛 actionlint (1.7.4)

.github/workflows/main.yml

29-29: the runner of "actions/checkout@v2" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🔇 Additional comments (6)

workflow/rules/common.smk (1)

16-17: 🛠️ Refactor suggestion

Verify schema paths and add error handling.

The schema paths are using relative paths (../../config/schemas/). This could be problematic if the working directory changes. Consider:

Using workflow.source_path() for reliable path resolution

Adding error handling for missing schema files
-validate(samples, schema="../../config/schemas/samples.schema.yml")
-validate(config, schema="../../config/schemas/config.schema.yml")
+validate(samples, schema=workflow.source_path("../config/schemas/samples.schema.yml"))
+validate(config, schema=workflow.source_path("../config/schemas/config.schema.yml"))
.test/config/multiqc_config.yml (1)

1-2: LGTM! Configuration looks valid.

The removal of samtools-stats section from the MultiQC report is properly configured.

🧰 Tools

🪛 yamllint (1.35.1)

[error] 2-2: no new line character at the end of file

(new-line-at-end-of-file)

.github/workflows/conventional-prs.yml (1)

Line range hint 1-20: LGTM! Workflow updates follow best practices.

Good improvements:

Explicit permissions following principle of least privilege

Updated action version for security

Clear job naming

config/config.yml (1)

1-27: Verify path consistency between test and production configs.

The paths differ between test and production configs:

Test config: config/samples.tsv

Production config: .test/config/samples.tsv

Test config: config/multiqc_config.yml

Production config: config/multiqc_config.yml

Please verify the correct paths and ensure consistency.

.github/workflows/main.yml (1)

40-43: Add missing job dependencies.

The Testing job should depend on both Linting and Formatting jobs to ensure proper execution order.

config/README.md (1)

3-82: Complete the workflow description placeholders.

Several placeholders need to be replaced with actual content:

<detailed description> on line 3

<name> in the title

🧰 Tools

🪛 LanguageTool

[style] ~81-~81: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...w name (can be the same as <repo>). * Replace <description> with a description of w...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 Markdownlint (0.37.0)

73-73: null
Spaces inside code span elements

(MD038, no-space-in-code)

73-73: null
Spaces inside code span elements

(MD038, no-space-in-code)

coderabbitai · 2025-01-22T09:53:02Z

workflow/rules/process_reads.smk

+        "cutadapt {params.threep_adapter} {params.fivep_adapter} --cores {threads} "
+        "-o {output.fastq1} -p {output.fastq2} {input.fastq1} {input.fastq1} > {log.stdout} 2> {log.stderr}"


⚠️ Potential issue

Fix incorrect input file reference in cutadapt command.

The second input file reference is incorrect, causing the same file to be processed twice.

- "-o {output.fastq1} -p {output.fastq2} {input.fastq1} {input.fastq1} > {log.stdout} 2> {log.stderr}" + "-o {output.fastq1} -p {output.fastq2} {input.fastq1} {input.fastq2} > {log.stdout} 2> {log.stderr}"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"cutadapt {params.threep_adapter} {params.fivep_adapter} --cores {threads} "

"-o {output.fastq1} -p {output.fastq2} {input.fastq1} {input.fastq1} > {log.stdout} 2> {log.stderr}"

"cutadapt {params.threep_adapter} {params.fivep_adapter} --cores {threads} "

"-o {output.fastq1} -p {output.fastq2} {input.fastq1} {input.fastq2} > {log.stdout} 2> {log.stderr}"

coderabbitai · 2025-01-22T09:53:02Z

workflow/scripts/get_genome.py

+def check_fasta(input_fasta, log=[], error=[]):
+    with open(input_fasta, "r") as fasta_file:
+        fasta = fasta_file.read()
+    n_items = fasta.count(">")
+    if n_items:
+        log += [f"Supplied fasta file '{input_fasta}' was found"]
+        log += [f"Supplied fasta file contains {n_items} items"]
+    else:
+        error += ["The supplied fasta file contains no valid entries starting with '>'"]
+    return fasta, log, error
+


⚠️ Potential issue

Fix mutable default arguments and improve code style.

The functions use mutable default arguments which can lead to unexpected behavior.

-def check_fasta(input_fasta, log=[], error=[]): +def check_fasta(input_fasta, log=None, error=None): + log = [] if log is None else log + error = [] if error is None else error -def check_gff(input_gff, log=[], error=[]): +def check_gff(input_gff, log=None, error=None): + log = [] if log is None else log + error = [] if error is None else error

Also applies to: 41-82

🧰 Tools

🪛 Ruff (0.8.2)

29-29: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

29-29: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

coderabbitai · 2025-01-22T09:53:02Z

workflow/scripts/get_genome.py

+input_database = snakemake.params["database"]
+input_assembly = snakemake.params["assembly"]
+input_fasta = snakemake.params["fasta"]
+input_gff = snakemake.params["gff"]
+output_path = snakemake.output["path"]
+output_fasta = snakemake.output["fasta"]
+output_gff = snakemake.output["gff"]
+output_log = snakemake.log["path"]


🛠️ Refactor suggestion

Add snakemake import and type hints.

The script uses the snakemake variable without importing it. Also, consider adding type hints for better code maintainability.

+from snakemake.script import Snakemake + +# Type hint for the snakemake object +snakemake: Snakemake

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

input_database = snakemake.params["database"]

input_assembly = snakemake.params["assembly"]

input_fasta = snakemake.params["fasta"]

input_gff = snakemake.params["gff"]

output_path = snakemake.output["path"]

output_fasta = snakemake.output["fasta"]

output_gff = snakemake.output["gff"]

output_log = snakemake.log["path"]

from snakemake.script import Snakemake

# Type hint for the snakemake object

snakemake: Snakemake

input_database = snakemake.params["database"]

input_assembly = snakemake.params["assembly"]

input_fasta = snakemake.params["fasta"]

input_gff = snakemake.params["gff"]

output_path = snakemake.output["path"]

output_fasta = snakemake.output["fasta"]

output_gff = snakemake.output["gff"]

output_log = snakemake.log["path"]

🧰 Tools

🪛 Ruff (0.8.2)

17-17: Undefined name snakemake

(F821)

18-18: Undefined name snakemake

(F821)

19-19: Undefined name snakemake

(F821)

20-20: Undefined name snakemake

(F821)

21-21: Undefined name snakemake

(F821)

22-22: Undefined name snakemake

(F821)

23-23: Undefined name snakemake

(F821)

24-24: Undefined name snakemake

(F821)

coderabbitai · 2025-01-22T09:53:02Z

workflow/scripts/get_genome.py

+if input_database.lower() == "ncbi":
+    ncbi_result = getoutput(
+        f"datasets summary genome accession {input_assembly} --as-json-lines | "
+        + "dataformat tsv genome --fields accession,annotinfo-release-date,organism-name"
+    )
+    if ncbi_result.startswith("Error"):
+        error += [ncbi_result]
+        error += [
+            "The supplied refseq/genbank ID was not valid. Example for correct input: 'GCF_000009045.1'"
+        ]
+    elif len(ncbi_result) == 0:
+        error += [
+            "The result from fetching NCBI genome data has zero length. Please check your internet connection!"
+        ]
+    else:
+        ncbi_genome = [
+            i.split("\t")
+            for i in ncbi_result.split("\n")
+            if not (i.startswith("New version") or i.startswith("Warning"))
+        ]
+        ncbi_genome = dict(zip(ncbi_genome[0], ncbi_genome[1]))
+        log += ["Found the following genome(s):"]
+        for k in ncbi_genome.keys():
+            log += ["{0}: {1}".format(k, ncbi_genome.get(k))]
+        refseq_id = ncbi_genome.get("Assembly Accession")
+        if not refseq_id.startswith("GCF_"):
+            error += ["The RefSeq ID '{0}' has no valid format.".format(refseq_id)]
+        ncbi_command = (
+            f"datasets download genome accession {refseq_id}"
+            + f" --filename {output_path}/database.zip --include genome,gff3; "
+            + f"cd {output_path}; unzip database.zip; rm database.zip"
+        )
+        copy_command = (
+            f"cp {output_path}/ncbi_dataset/data/{refseq_id}/*.fna {output_fasta}; "
+            + f"cp {output_path}/ncbi_dataset/data/{refseq_id}/genomic.gff {output_gff}"
+        )
+        index_command = f"samtools faidx {output_fasta}"
+        str_out = getoutput(ncbi_command)
+        str_cp = getoutput(copy_command)
+        # import and check files
+        fasta, log, error = check_fasta(output_fasta, log, error)
+        index_out = getoutput(index_command)
+        gff, log, error = check_gff(output_gff, log, error)
+        # write cleaned gff file
+        with open(output_gff, "w") as gff_out:
+            GFF.write(gff, gff_out)
+


🛠️ Refactor suggestion

Add timeout handling for NCBI API calls.

The script should handle potential timeouts when fetching data from NCBI.

+import requests.exceptions +from subprocess import getoutput, run, PIPE, TimeoutError +def execute_with_timeout(command, timeout=300): + try: + result = run(command, shell=True, stdout=PIPE, stderr=PIPE, timeout=timeout, text=True) + return result.stdout + except TimeoutError: + return "Error: Command timed out after {} seconds".format(timeout) - ncbi_result = getoutput( + ncbi_result = execute_with_timeout( f"datasets summary genome accession {input_assembly} --as-json-lines | " + "dataformat tsv genome --fields accession,annotinfo-release-date,organism-name" )

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Ruff (0.8.2)

107-107: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

coderabbitai · 2025-01-22T09:53:02Z

.test/config/config.yml

+  gff_source_type:
+    [
+      "RefSeq": "gene",
+      "RefSeq": "pseudogene",
+      "RefSeq": "CDS",
+      "Protein Homology": "CDS",
+    ]


⚠️ Potential issue

Fix duplicate keys in gff_source_type configuration.

The current structure has duplicate "RefSeq" keys which will cause only the last entry to be used. This means the "gene" and "pseudogene" entries will be lost.

Restructure the configuration to preserve all entries:

gff_source_type: [ - "RefSeq": "gene", - "RefSeq": "pseudogene", - "RefSeq": "CDS", + {"source": "RefSeq", "type": "gene"}, + {"source": "RefSeq", "type": "pseudogene"}, + {"source": "RefSeq", "type": "CDS"}, "Protein Homology": "CDS", ]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

gff_source_type:

[

"RefSeq": "gene",

"RefSeq": "pseudogene",

"RefSeq": "CDS",

"Protein Homology": "CDS",

]

gff_source_type:

[

{"source": "RefSeq", "type": "gene"},

{"source": "RefSeq", "type": "pseudogene"},

{"source": "RefSeq", "type": "CDS"},

"Protein Homology": "CDS",

]

coderabbitai · 2025-01-22T09:53:02Z

.github/workflows/main.yml

-        directory: .
-        snakefile: workflow/Snakefile
-        args: "--lint"
+      - uses: actions/checkout@v2


⚠️ Potential issue

Update the checkout action version.

The runner of "actions/checkout@v2" action is outdated. Update to v4 for consistency with other jobs.

- - uses: actions/checkout@v2 + - uses: actions/checkout@v4

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- uses: actions/checkout@v2

- uses: actions/checkout@v4

🧰 Tools

🪛 actionlint (1.7.4)

29-29: the runner of "actions/checkout@v2" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

m-jahn and others added 8 commits November 29, 2024 10:42

feat: updated all GH actions to latest versions

4d7b3a2

feat: complete, reproducible example workflow

1dfa7ad

fix: updated schemas and params docs

facf377

fix: dont remove temp files for test runs

0c2c8d1

fix: lint error and docs update

cf59f11

fix: added usage docs

776b97e

Merge pull request #1 from MPUSP/dev

2348055

feat: complete minimal workflow as template

feat: updated github actions workflow

7a3a40e

coderabbitai bot reviewed Jan 22, 2025

View reviewed changes

rabioinf closed this Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: update github actions workflow #9

feat: update github actions workflow #9

Uh oh!

rabioinf commented Jan 22, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 22, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 22, 2025

Uh oh!

coderabbitai bot Jan 22, 2025

Uh oh!

coderabbitai bot Jan 22, 2025

Uh oh!

coderabbitai bot Jan 22, 2025

Uh oh!

coderabbitai bot Jan 22, 2025

Uh oh!

coderabbitai bot Jan 22, 2025

Uh oh!

Uh oh!

		"cutadapt {params.threep_adapter} {params.fivep_adapter} --cores {threads} "
		"-o {output.fastq1} -p {output.fastq2} {input.fastq1} {input.fastq1} > {log.stdout} 2> {log.stderr}"

feat: update github actions workflow #9

feat: update github actions workflow #9

Uh oh!

Conversation

rabioinf commented Jan 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rabioinf commented Jan 22, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 22, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)